Search CORE

17 research outputs found

Word Embedding Evaluation in Downstream Tasks and Semantic Analogies

Author: Consoli Bernardo
Santos Joaquim
Vieira Renata
Publication venue: LREC/ELRA
Publication date: 01/05/2020
Field of study

Language Models have long been a prolific area of study in the field of Natural Language Processing (NLP). One of the newer kinds of language models, and some of the most used, are Word Embeddings (WE). WE are vector space representations of a vocabulary learned by a non-supervised neural network based on the context in which words appear. WE have been widely used in downstream tasks in many areas of study in NLP. These areas usually use these vector models as a feature in the processing of textual data. This paper presents the evaluation of newly released WE models for the Portuguese language, trained with a corpus composed of 4.9 billion tokens. The first evaluation presented an intrinsic task in which WEs had to correctly build semantic and syntactic relations. The second evaluation presented an extrinsic task in which the WE models were used in two downstream tasks: Named Entity Recognition and Semantic Similarity between Sentences. Our results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that passing the text in parts to the WE generating algorithm may cause loss of quality

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Repositório Científico da Universidade de Évora

Embeddings for Named Entity Recognition in Geoscience Portuguese Literature

Author: Consoli Bernardo
Cordeiro Fabio
Gomes Diogo
Moreira Viviane
Santos Joaquim
Vieira Renata
Publication venue: LREC
Publication date: 01/05/2020
Field of study

This work focuses on Portuguese Named Entity Recognition (NER) in the Geology domain. The only domain-specific dataset in the Portuguese language annotated for Named Entity Recognition is the GeoCorpus. Our approach relies on Bidirecional Long Short-Term Memory - Conditional Random Fields neural networks (BiLSTM-CRF) - a widely used type of network for this area of research - that use vector and tensor embedding representations. We used three types of embedding models (Word Embeddings, Flair Embeddings, and Stacked Embeddings) under two versions (domain-specific and generalized). We originally trained the domain specific Flair Embeddings model with a generalized context in mind, but we fine-tuned with domain-specific Oil and Gas corpora, as there simply was not enough domain corpora to properly train such a model. We evaluated each of these embeddings separately, as well as we stacked with another embedding. Finally, we achieved state-of-the-art results for this domain with one of our embeddings, and we performed an error analysis on the language model that achieved the best results. Furthermore, we investigated the effects of domain-specific versus generalized embeddings.UIDB/00057/2020, CEECIND/01997/201

Repositório Científico da Universidade de Évora

Análise da capacidade de identificação de paráfrase em ferramentas de resolução de correferência

Author: Bernardo Scapini Consoli
Joaquim Francisco Santos Neto
Renata Vieira
Sandra Collovini de Abreu
Publication venue: 'University of Minho'
Publication date: 01/01/2019
Field of study

Os fenômenos linguísticos de correferência e paráfrase compartilham certos aspectos. É comum, por exemplo, referir-se a uma mesma entidade de maneiras diferentes em um mesmo contexto, assim, a resolução de correferências pode auxiliar o processo de identificação de paráfrases. Este artigo apresenta uma análise das capacidades da ferramenta de resolução de correferência CORP, para Português, no contexto de identificação de paráfrases nos níveis de sentença e de sintagma

Directory of Open Access Journals

O aproveitamento do resíduo da indústria do sisal no controle de larvas de mosquitos

Author: Alfredo M. Oliveira Filho
Alio AY
Alves SB
Ana Paula B. Pizarro
Bernardo RR
Brum APO
Celso E. dos Santos
Consoli RAGB
Daharam Shaktu NS
Heal RE
José P. Parente
Lacaille-Dubois MA
Marli T.V. Melo
Mors WB
Oliveira Filho AM
Oliveira Filho AM
Paulo R. Lima
Perez N
Silva ALV
Wilkomirski B
Publication venue: 'FapUNIFESP (SciELO)'
Publication date
Field of study

Crossref

Effect of the Aedes fluviatilis saliva on the development of Plasmodium gallinaceum infection in Gallus (gallus) domesticus

Author: Alger NE
Alger NE
Ana CVM da Rocha
Belkaid YS
Bernardo S Franklin
Coatney GR
Consoli RA
Coulston F
Daher VR
Gillespie RD
Huff CG
James AA
Junes LD
Kamhawi S
Kogut M
Kogut MH
Laemmli UK
Li X
Lowry OH
Mathews GV
Mccutchan T F
Mccutchan TF
Meis JFGM
Márcio SS Araújo
Ozaki LS
Paraense WL
Paraense WL
Paulo FP Pimenta
Pimenta PF
Ribeiro JMC
Ribeiro JMC
Ribeiro JMC
Ribeiro JMC
Rosenberg R
Rosenberg R
Simonetti AB
Titus RG
Vanderberg JP
Vaughan JA
Waters AP
Érika M Braga
Publication venue: 'FapUNIFESP (SciELO)'
Publication date
Field of study

Crossref

Envoltória de Ruptura Curva de Areias Artificialmente Cimentadas

Author: Consoli Bernardo Scapini
Publication venue
Publication date: 01/01/2012
Field of study

Lume 5.8

Enriching Portuguese Word Embeddings with Visual Information

Author: Consoli Bernardo
Vieira Renata
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 16/03/2022
Field of study

This work focuses on the enrichment of existing Portuguese word embeddings with visual information in the form of visual embeddings. This information was extracted from images portraying given vocabulary terms and imagined visual embeddings learned for terms with no image data. These enriched embeddings were tested against their text-only counterparts in common NLP tasks. The results show an increase in performance for several tasks, which indicates that visual information fusion for word embeddings can be useful for word embedding based NLP tasks.FCT CEECIND/01997/2017, UIDB/00057/202

Repositório Científico da Universidade de Évora

Benchmarking the BRATECA Clinical Data Collection for Prediction Tasks

Author: Bordin Rafael
Consoli Bernardo
Vieira Renata
Publication venue: 'Scitepress'
Publication date: 01/02/2023
Field of study

Expanding the usability of location-specific clinical datasets is an important step toward expanding research into national medical issues, rather than only attempting to generalize hypotheses from foreign data. This means that benchmarking such datasets, thus proving their usefulness for certain kinds of research, is a worth- while task. This paper presents the first results of widely used prediction tasks from data contained within the BRATECA collection, a Brazilian tertiary care data collection, and also results for neural network architec- tures using these newly created test sets. The architectures use both structured and unstructured data to achieve their results. The obtained results are expected to serve as benchmarks for future tests with more advanced models based on the data available in BRATECA

Repositório Científico da Universidade de Évora

BRATECA (Brazilian Tertiary Care Dataset): a Clinical Information Dataset for the Portuguese Language

Author: Ana Ulbrich
Bordini Rafael
Consoli Bernardo
Dias Henrique
Vieira Renata
Publication venue: LREC
Publication date: 01/06/2022
Field of study

Computational medicine research requires clinical data for training and testing purposes, so the development of datasets composed of real hospital data is of utmost importance in this field. Most such data collections are in the English language, were collected in anglophone countries, and do not reflect other clinical realities, which increases the importance of national datasets for projects that hope to positively impact public health. This paper presents a new Brazilian Clinical Dataset containing over 70,000 admissions from 10 hospitals in two Brazilian states, composed of a sum total of over 2.5 million free-text clinical notes alongside data pertaining to patient information, prescription information, and exam results. This data was collected, organized, deidentified, and is being distributed via credentialed access for the use of the research community. In the course of presenting the new dataset, this paper will explore the new dataset’s structure, population, and potential benefits of using this dataset in clinical AI tasks.FCT UIDB/00057/2020, CEECIND/01997/201

Repositório Científico da Universidade de Évora

Portuguese word embeddings for the oil and gas industry: Development and evaluation

Author: Consoli Bernardo
Cordeiro Fábio
Evsukoff Alexandre
Gomes Diogo
Moraes Silvia
Moreira Viviane
Santos Nikolas
Vieira Renata
Publication venue: 'Elsevier BV'
Publication date: 01/01/2023
Field of study

Over the last decades, oil and gas companies have been facing a continuous increase of data collected in unstructured textual format. New disruptive technologies, such as natural language processing and machine learning, present an unprecedented opportunity to extract a wealth of valuable information within these documents. Word embedding models are one of the most fundamental units of natural language processing, enabling machine learning algorithms to achieve great generalization capabilities by providing meaningful representations of words, being able to capture syntactic and semantic features based on their context. However, the oil and gas domain-specific vocabulary represents a challenge to those algorithms, in which words may assume a completely different meaning from a common understanding. The Brazilian pre-salt is an important exploratory frontier for the oil and gas industry, with increasing attractiveness for international investments in exploration and production projects, and most of its documentation is in Portuguese. Moreover, Portuguese is one of the largest languages in terms of number of native speakers. Nonetheless, despite the importance of the petroleum sector of Portuguese speaking countries, specialized public corpora in this domain are scarce. This work proposes PetroVec, a representative set of word embedding models for the specific domain of oil and gas in Portuguese. We gathered an extensive collection of domain-related documents from leading institutions to build a large specialized oil and gas corpus in Portuguese, comprising more than 85 million tokens. To provide an intrinsic evaluation, assessing how well the models can encode domain semantics from the text, we created a semantic relatedness test set, comprising 1,500 word pairs labeled by selected experts in geoscience and petroleum engineering from both academia and industry. In addition, we performed an extrinsic quantitative evaluation on a downstream task of named entity recognition in geoscience, plus a set of qualitative analyses, and conducted a comparative evaluation against a public general-domain embedding model. The obtained results suggest that our domain-specific models outperformed the general model on their ability to represent specialized terminology. To the best of our knowledge, this is the first attempt to generate and evaluate word embedding models for the oil and gas domain in Portuguese. Finally, all the resources developed by this work are made available for public use, including the pre-trained specialized models, corpora, and validation datasets.FCT CEECIND/01997/201

Repositório Científico da Universidade de Évora